{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "OpenImages V4\n",
    "=============\n",
    "\n",
    "https://storage.googleapis.com/openimages/web/download_v4.html\n",
    "\n",
    "The full set is 9,178,275 image URLs (mostly on Flickr, and some are no longer available).\n",
    "\n",
    "#### Target Set\n",
    "Of the full dataset, 1,743,042 images are annotated and hosted by the [CVDF](https://github.com/cvdfoundation/open-images-dataset). They are labeled as \"training data\", but our encoder was not trained on these images; instead, we held them out to use as a target database for our simulations and wetlab experiments.\n",
    "\n",
    "#### Validation Set\n",
    "The CVDF also hosts a validation set consisting of 41,620 images, which we used to track the performance of the encoder during training.\n",
    "\n",
    "#### Training Set\n",
    "To build our encoder training set, we subtracted the target set from the full dataset, and downloaded whatever was available out of the first 1,200,000 remaining images.\n",
    "\n",
    "#### Extended Target Set\n",
    "The remainder of the full dataset (about 6 million image URLs) is used as an extended target database for performing scalability simulations.\n",
    "\n",
    "### Download Target and Validation Sets\n",
    "\n",
    "Run the following code to download the target set (513 gigabytes) and validation set (12 gigagbytes). It will take some time:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!aws s3 --no-sign-request sync \\\n",
    "    s3://open-images-dataset/tar/ \\\n",
    "    /tf/open_images/targets/images/ \\\n",
    "    --exclude \"*\" --include \"train_*.tar.gz\"\n",
    "    \n",
    "!aws s3 --no-sign-request sync \\\n",
    "    s3://open-images-dataset/tar/ \\\n",
    "    /tf/open_images/validation/images/ \\\n",
    "    --exclude \"*\" --include \"validation.tar.gz\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This code will convert the `.tar.gz` files to `.zip` files (which will make accessing the images easier). "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import glob\n",
    "tgzs = glob.glob('/tf/open_images/targets/images/*.tar.gz')\n",
    "!cd /tf/open_images/targets/images/\n",
    "for tgz in tgzs:\n",
    "    dname = tgz.replace('.tar.gz', '')\n",
    "    !tar -xf {tgz}\n",
    "    !zip -rq {dname + '.zip'} {dname}\n",
    "    !rm -rf {dname} {tgz}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Download Metadata \n",
    "The images used to train the encoder were taken from the full, un-annotated dataset. Use this code to download the image IDs and URLs for the full dataset (3.1 gigabytes), and the annotated dataset (609 megabytes)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!wget -c -P /tf/open_images/metadata/ 'https://storage.googleapis.com/openimages/2018_04/image_ids_and_rotation.csv'\n",
    "!wget -c -P /tf/open_images/metadata/ 'https://storage.googleapis.com/openimages/2018_04/train/train-images-boxable-with-rotation.csv'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Assemble Encoder Training Set and Extended Target Set\n",
    "This code will get the URLs of images that are not in the original target set, to be used for training the encoder (and for additional targets):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import requests\n",
    "import hashlib\n",
    "import os, sys\n",
    "import pandas as pd\n",
    "from PIL import Image\n",
    "from io import BytesIO\n",
    "from multiprocessing import Pool"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def mkdirs(subset):\n",
    "    if not os.path.exists('/tf/open_images/%s' % subset):\n",
    "        os.mkdir('/tf/open_images/%s' % subset)\n",
    "        os.mkdir('/tf/open_images/%s/images' % subset)\n",
    "\n",
    "    for i in range(256):\n",
    "        path = '/tf/open_images/%s/images/%02x' % (subset, i)\n",
    "        if not os.path.exists(path):\n",
    "            os.mkdir(path)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "full_set = pd.read_csv('/tf/open_images/metadata/image_ids_and_rotation.csv').set_index(\"ImageID\")\n",
    "target_set = pd.read_csv('/tf/open_images/metadata/train-images-boxable-with-rotation.csv').set_index(\"ImageID\")\n",
    "unused = full_set[(full_set.Subset == 'train') & ~full_set.index.isin(target_set.index)]\n",
    "train_set = unused[:1200000]\n",
    "extended_target_set = unused[1200000:]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Download Encoder Training Set\n",
    "This code will download, resize, and save images from the training set. It will attempt to download 1,200,000 images. The final number will be less because some URLs point to images that are no longer available.\n",
    "\n",
    "**Warning**: This will probably take at least a full day to complete."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "mkdirs('train')\n",
    "subset = 'train'\n",
    "def download((img_id, img_meta)):\n",
    "    resp = requests.get(img_meta.OriginalURL)\n",
    "    \n",
    "    img_data = resp.content\n",
    "    md5 = hashlib.md5(img_data).digest().encode(\"base64\").strip()\n",
    "    \n",
    "    if md5 != img_meta.OriginalMD5:\n",
    "        return False\n",
    "    \n",
    "    image = Image.open(BytesIO(img_data))\n",
    "    image.thumbnail([1024,1024])\n",
    "    \n",
    "    img_prefix = img_id[:2]\n",
    "    filename = '/tf/open_images/%s/images/%s/%s.jpg' % (subset, img_prefix, img_id)\n",
    "    image.save(filename)\n",
    "        \n",
    "    return True"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": false
   },
   "outputs": [],
   "source": [
    "step = 100000\n",
    "pool = Pool()\n",
    "try:\n",
    "    for start in range(0, len(train_set), step):\n",
    "        print start\n",
    "        download_set = train_set[[\"OriginalURL\",\"OriginalMD5\"]][start : start+step].iterrows()\n",
    "        checks = pool.map(download, download_set)\n",
    "finally:\n",
    "    pool.close()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Download Extended Target Set\n",
    "This code will download, resize, and save images from the extended target set. It will attempt to download 6,068,177 images. The final number will be less because some URLs point to images that are no longer available.\n",
    "\n",
    "**Warning**: This will probably take at least several days to complete."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "mkdirs('extended_targets')\n",
    "subset = 'extended_targets'\n",
    "def download((img_id, img_meta)):\n",
    "    resp = requests.get(img_meta.OriginalURL)\n",
    "    \n",
    "    img_data = resp.content\n",
    "    md5 = hashlib.md5(img_data).digest().encode(\"base64\").strip()\n",
    "    \n",
    "    if md5 != img_meta.OriginalMD5:\n",
    "        return False\n",
    "    \n",
    "    image = Image.open(BytesIO(img_data))\n",
    "    image.thumbnail([1024,1024])\n",
    "    \n",
    "    img_prefix = img_id[:2]\n",
    "    filename = '/tf/open_images/%s/images/%s/%s.jpg' % (subset, img_prefix, img_id)\n",
    "    image.save(filename)\n",
    "        \n",
    "    return True"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "step = 100000\n",
    "pool = Pool()\n",
    "try:\n",
    "    for start in range(0, len(extended_target_set), step):\n",
    "        print start\n",
    "        download_set = extended_target_set[[\"OriginalURL\",\"OriginalMD5\"]][start : start+step].iterrows()\n",
    "        checks = pool.map(download, download_set)\n",
    "finally:\n",
    "    pool.close()"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 2",
   "language": "python",
   "name": "python2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.15+"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
